Skip to main content

Multivariate Data Analysis among Machine Learning

Linear Regression

We use Least Square to estimate the parameters of a linear model where β^=(ZZ)1Zy\hat \beta = (Z'Z)^{-1}Z'y where ZZ is the design matrix (data frame X with extra 1 column 1 at beginning) and yy is the response vector. The residual (the deviations) is ϵ=yZβ^\epsilon = y - Z\hat \beta and the residual sum of square is RSS=ϵ^ϵ^RSS = \hat\epsilon'\hat\epsilon.

  • It comes from mini(yiy^i)2\min \sum_i (y_i - \hat y_i)^2 where y^i=β^0+β^1xi\hat y_i = \hat \beta_0 + \hat \beta_1 x_i, and LS solution is unbiased where Cov(β^)=σ2(ZZ)1Cov(\hat \beta) = \sigma^2 (Z'Z)^{-1} and E(β^)=βE(\hat \beta) = \beta; Cov(ϵ^)=σ2(IH)Cov(\hat \epsilon) = \sigma^2 (I - H) and E(ϵ^)=0E(\hat \epsilon) = 0.
  • We also define hat matrix H=Z(ZZ)1ZH = Z(Z'Z)^{-1}Z' and HH is the orthogonal projector matrix onto the column space of ZZ. Then, y^=Zβ^=Hy\hat y = Z\hat \beta = Hy.
  • We also have Z(IH)=0Z'(I - H) = 0 and we can rewrite residual as ϵ^=yy^=(IH)y\hat\epsilon = y - \hat y = (I - H)y satisfy Zϵ^=0Z'\hat\epsilon = 0 and y^ϵ^=0\hat y'\hat \epsilon = 0.
  • Now RSS=ϵ^ϵ^=y(IH)y=yyyHy=yyyZβ^RSS = \hat\epsilon'\hat\epsilon = y'(I - H)y = y'y - y'H'y = y'y - y'Z\hat\beta.
  • yy=(y^+ϵ^)(y^+ϵ^)=y^y^+ϵ^ϵ^y'y = (\hat y + \hat \epsilon)'(\hat y + \hat \epsilon) = \hat y'\hat y + \hat \epsilon' \hat \epsilon so we can have TSS=yynyˉ2=y^y^nyˉ2+ϵ^ϵ^=i=1n(yiyˉ)2=i=1n(y^iyˉ)2+i=1n(ϵ^i)2TSS = y'y - n\bar y^2 = \hat y'\hat y - n\bar y^2 + \hat \epsilon' \hat \epsilon = \sum_{i = 1}^n (y_i - \bar y)^2 = \sum_{i = 1}^n (\hat y_i - \bar y)^2 + \sum_{i = 1}^n (\hat \epsilon_i)^2.
  • Since R2R^2 (the coefficient of determination) R2=1RSSTSSR^2 = 1 - \frac{RSS}{TSS}, then R2=1i=1n(yiy^i)i=1nyiyˉ=i=1n(y^iyˉ)i=1nyiyˉR^2 = 1 - \frac{\sum_{i = 1}^n(y_i - \hat y_i)}{\sum_{i = 1}^ny_i - \bar y} = \frac{\sum_{i = 1}^n(\hat y_i - \bar y)}{\sum_{i = 1}^n y_i - \bar y}
  • If ZZ is not full rank, then β^=(ZZ)Zy\hat \beta = (Z'Z)^{-}Z'y is the solution

Principal Component Analysis

We always use principal component to serve regression analysis or cluster analysis.

Given random vector X=[X1,X2,,Xp]X' = [X_1, X_2, \ldots, X_p] have covariance matrix Σ\Sigma with eigenvalues λ1λ2λp0\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_p \ge 0. Let says Y1=a1X,,Yp=apXY_1 = a_1'X, \ldots, Y_p = a_p'X where a1,,apa_1, \ldots, a_p. Then we have Var[Yi]=aiΣaiVar[Y_i] = a_i'\Sigma a_i and Cov[Yi,Yj]=aiΣajCov[Y_i, Y_j] = a_i'\Sigma a_j. The Principal Components (PCs) are those uncorrelated linear combinations Y1,Y2,Y_1, Y_2, \ldots whose variance Var[Yi]=aiΣaiVar[Y_i] = a_i'\Sigma a_i as large as possible subject to aiai=1a_i'a_i = 1. That is:

  • first PC = Linear combination a1Xa_1'X that maximizes Var[a1X]Var[a_1'X] subject to a1a1=1a_1'a_1 = 1.
  • second PC = Linear combination a2Xa_2'X that maximizes Var[a2X]Var[a_2'X] subject to a2a2=1a_2'a_2 = 1 and Cov(a1X,a2X)=0Cov(a_1'X, a_2'X) = 0.
  • ...
  • ith PC = Linear combination aiXa_i'X that maximizes Var[aiX]Var[a_i'X] subject to aiai=1a_i'a_i = 1 and Cov(aiX,akX)=0,k<iCov(a_i'X, a_k'X) = 0, \forall k < i.
  • Since we subject to aiai=1a_i'a_i = 1, we can have Var[Yi]=aiΣai/aiaiVar[Y_i] = a_i'\Sigma a_i/a_i'a_i so that Var[Yi]=λiVar[Y_i] = \lambda_i is maximum attained ai=ϵia_i = \epsilon_i where λi\lambda_i is the ith eigenvalue of Σ\Sigma.
  • Assume such ith PC with ai=eia_i' = e_i' where eie_i is the ith eigenvector of Σ\Sigma associated with λi\lambda_i. Then Var[Yi]=λiVar[Y_i] = \lambda_i and Cov[Yi,Yj]=0,jiCov[Y_i, Y_j] = 0, \forall j \ne i.
  • i,ai=ei\forall i, a_i = e_i then we have i=1pVar[Yi]=i=1pλi\sum_{i = 1}^p Var[Y_i] = \sum_{i = 1}^p \lambda_i where we can conclude i=1pλi=Tr(Σ)=i=1pσii2\sum_{i = 1}^p \lambda_i = Tr(\Sigma) = \sum_{i = 1}^p \sigma_{ii}^2. We call this the total variance.
  • The proportions for each PC are λii=1pλi\frac{\lambda_i}{\sum_{i = 1}^p \lambda_i}.